Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
pyo.init_notebook_mode()
data = pd.read_csv('dataset.csv')
data.sample(5)
| Date | Store_City | Store_Location | Store_Name | Product_Category | Product_Cost | Product_Price | Product_Name | Items_Sold | |
|---|---|---|---|---|---|---|---|---|---|
| 46101 | 2018-01-14 | Aguascalientes | Downtown | Maven Toys Aguascalientes 1 | Games | $3.99 | $6.99 | Deck Of Cards | 1 |
| 523463 | 2017-01-24 | Mexicali | Downtown | Maven Toys Mexicali 2 | Sports & Outdoors | $11.99 | $15.99 | Dart Gun | 3 |
| 631003 | 2018-02-23 | Guadalajara | Downtown | Maven Toys Guadalajara 4 | Art & Crafts | $1.99 | $2.99 | PlayDoh Can | 1 |
| 52425 | 2017-02-22 | Guadalajara | Residential | Maven Toys Guadalajara 1 | Games | $3.99 | $6.99 | Deck Of Cards | 1 |
| 530100 | 2018-02-18 | Puebla | Downtown | Maven Toys Puebla 2 | Toys | $10.99 | $14.99 | Dinosaur Figures | 1 |
This data is not aggregated or in other words this data is not summerized.
Each Row represents:
Sale Information : Date, Units
Store Information: Store_City, Store_Location, Store_Name
Product Information : Product_Category, Product_Name, Product_Cost, Product_Price
There's Hierarchical Information

data['Product_Cost'] = data.Product_Cost.str.replace('$','',regex=False).astype(float)
data['Product_Price'] = data.Product_Price.str.replace('$','',regex=False).astype(float)
converting Date column to datetime dtype
data['Date'] = pd.to_datetime(data.Date)
let's check the data again
print(data.shape)
data.sample(5)
(829262, 9)
| Date | Store_City | Store_Location | Store_Name | Product_Category | Product_Cost | Product_Price | Product_Name | Items_Sold | |
|---|---|---|---|---|---|---|---|---|---|
| 354135 | 2018-02-10 | Monterrey | Downtown | Maven Toys Monterrey 2 | Art & Crafts | 1.99 | 3.99 | Barrel O' Slime | 3 |
| 555259 | 2017-06-18 | Puebla | Residential | Maven Toys Puebla 3 | Games | 5.99 | 10.99 | Glass Marbles | 1 |
| 185401 | 2018-08-16 | Guadalajara | Residential | Maven Toys Guadalajara 1 | Electronics | 6.99 | 14.99 | Colorbuds | 1 |
| 729877 | 2018-06-30 | Hermosillo | Residential | Maven Toys Hermosillo 1 | Art & Crafts | 13.99 | 19.99 | Kids Makeup Kit | 1 |
| 742180 | 2018-06-19 | Guadalajara | Airport | Maven Toys Guadalajara 3 | Toys | 8.99 | 19.99 | Plush Pony | 1 |
How many sale records do we have?
data.shape
(829262, 9)
Let's draw the histogram of Store_location column
data.Store_Location.hist()
<AxesSubplot:>
There are four different store locations in our data.
Seems like the stores located in Downtown have the highest number of sale records and the stores located in Airport have the lowest.
Let's draw the histogram of Product_Category column
data.Product_Category.hist()
<AxesSubplot:>
What are the insights?
Draw the Product_cost Historgram
data.Product_Cost.hist()
<AxesSubplot:>
Draw the Product_Price Historgram
data.Product_Price.hist()
<AxesSubplot:>
Let's calculate a column called "Profit" to calculate pure Income of each sale.
data['Profit'] = (data['Product_Price'] - data['Product_Cost']) * data['Items_Sold']
data.sample(5)
| Date | Store_City | Store_Location | Store_Name | Product_Category | Product_Cost | Product_Price | Product_Name | Items_Sold | Profit | |
|---|---|---|---|---|---|---|---|---|---|---|
| 822655 | 2018-07-19 | Xalapa | Downtown | Maven Toys Xalapa 2 | Sports & Outdoors | 8.99 | 11.99 | Foam Disk Launcher | 1 | 3.0 |
| 640036 | 2017-12-26 | Culiacan | Downtown | Maven Toys Culiacan 1 | Art & Crafts | 1.99 | 2.99 | PlayDoh Can | 2 | 2.0 |
| 330749 | 2018-09-15 | Chihuahua | Downtown | Maven Toys Chihuahua 2 | Art & Crafts | 1.99 | 3.99 | Barrel O' Slime | 1 | 2.0 |
| 41456 | 2017-05-11 | Cuernavaca | Downtown | Maven Toys Cuernavaca 1 | Sports & Outdoors | 7.99 | 8.99 | Splash Balls | 1 | 1.0 |
| 307753 | 2017-12-20 | Hermosillo | Downtown | Maven Toys Hermosillo 2 | Sports & Outdoors | 6.99 | 9.99 | Mini Ping Pong Set | 1 | 3.0 |
Let's say we want to analyze the data in Santiago city
Let's calculate the daily sale profits in Santiago
data_Santiago = data[data['Store_City']=='Santiago']
data_Santiago=data_Santiago[['Date', 'Profit']]
data_Santiago = data_Santiago.sort_values('Date')
data_Santiago.head()
| Date | Profit | |
|---|---|---|
| 257208 | 2017-01-01 | 6.0 |
| 722768 | 2017-01-01 | 12.0 |
| 722769 | 2017-01-01 | 12.0 |
| 722770 | 2017-01-01 | 12.0 |
| 722771 | 2017-01-01 | 18.0 |
data_Santiago_daily = data_Santiago.groupby('Date').sum().reset_index()
data_Santiago_daily.head()
| Date | Profit | |
|---|---|---|
| 0 | 2017-01-01 | 246.0 |
| 1 | 2017-01-02 | 32.0 |
| 2 | 2017-01-04 | 57.0 |
| 3 | 2017-01-05 | 148.0 |
| 4 | 2017-01-06 | 144.0 |
let's visualize this data using a line chart
fig = px.line(data_Santiago_daily, x='Date', y='Profit')
fig.show()
We can change the granularity level by aggregating our data based on YEAR and MONTH

Each year has different months and each months has different days.
What if we have the following analytical question:
Compare Monthly profit in 2017 and 2018 and find the the Month with the highest Monthly profit
Let's create a column called MONTH and extract the month value from the Date column
data_Santiago['Year'] = pd.DatetimeIndex(data_Santiago['Date']).year
data_Santiago['Month'] = pd.DatetimeIndex(data_Santiago['Date']).month
data_Santiago_monthly = data_Santiago.groupby(["Year","Month"]).sum().reset_index()
data_Santiago_monthly['Year-Month'] = data_Santiago_monthly.Year.astype(str) + '-'+data_Santiago_monthly.Month.astype(str)
let's check our data again
data_Santiago_monthly.head()
| Year | Month | Profit | Year-Month | |
|---|---|---|---|---|
| 0 | 2017 | 1 | 3185.0 | 2017-1 |
| 1 | 2017 | 2 | 2285.0 | 2017-2 |
| 2 | 2017 | 3 | 3212.0 | 2017-3 |
| 3 | 2017 | 4 | 4214.0 | 2017-4 |
| 4 | 2017 | 5 | 3421.0 | 2017-5 |
Let's use a bar cahrt to visualize the results
fig = px.bar(data_Santiago_monthly, x='Year-Month', y='Profit', title = 'Santiago monthly profit')
fig.show()
Let's Answer to some Analytical question.

- Aggregating our data by Product_Name and sum up the profits.
Q1 = data[['Product_Name', 'Profit']].groupby('Product_Name').sum().reset_index()
Q1.head()
| Product_Name | Profit | |
|---|---|---|
| 0 | Action Figure | 347748.0 |
| 1 | Animal Figures | 117267.0 |
| 2 | Barrel O' Slime | 183326.0 |
| 3 | Chutes & Ladders | 11487.0 |
| 4 | Classic Dominoes | 8942.0 |
Let's use a bar chart to visualize the data
fig = px.bar(Q1, x='Product_Name', y='Profit')
fig.show()
- We have to Aggregating our data by Product_Category and sum up the profits
- It means we are going to summerize our data to Product Category Level.
Q2 = data[['Product_Category', 'Profit']].groupby('Product_Category').sum().reset_index()
Q2.head()
| Product_Category | Profit | |
|---|---|---|
| 0 | Art & Crafts | 753354.0 |
| 1 | Electronics | 1001437.0 |
| 2 | Games | 673993.0 |
| 3 | Sports & Outdoors | 505718.0 |
| 4 | Toys | 1079527.0 |
Let's use a bar chart to visualize the data
fig = px.bar(Q2, x='Product_Category', y='Profit')
fig.show()
- We have to Aggregating our data by Product_Name and Product_Category and sum up the profits.
Q3 = data[['Product_Category', 'Product_Name','Profit']].groupby(['Product_Category','Product_Name']).sum().reset_index()
Q3.head()
| Product_Category | Product_Name | Profit | |
|---|---|---|---|
| 0 | Art & Crafts | Barrel O' Slime | 183326.0 |
| 1 | Art & Crafts | Etch A Sketch | 121680.0 |
| 2 | Art & Crafts | Kids Makeup Kit | 146598.0 |
| 3 | Art & Crafts | Magic Sand | 121196.0 |
| 4 | Art & Crafts | PlayDoh Can | 103128.0 |
fig = px.bar(Q3, x='Product_Name', y='Profit', color = 'Product_Category')
fig.show()
fig = px.bar(Q3, x='Product_Category', y='Profit', color = 'Product_Name')
fig.show()
Q5 = data[['Store_Location','Product_Category','Profit']].groupby(['Store_Location','Product_Category']).sum().reset_index()
Q5.head()
| Store_Location | Product_Category | Profit | |
|---|---|---|---|
| 0 | Airport | Art & Crafts | 61441.0 |
| 1 | Airport | Electronics | 108197.0 |
| 2 | Airport | Games | 80768.0 |
| 3 | Airport | Sports & Outdoors | 39393.0 |
| 4 | Airport | Toys | 88250.0 |
fig = px.bar(Q5, x='Product_Category', y='Profit', color = 'Store_Location')
fig.show()
Let's visualize the same data using sunburst diagram
fig = px.sunburst(Q5, path=['Store_Location','Product_Category'], values = 'Profit')
fig.show()
Let's visualize the same data using Treemap diagram
fig = px.treemap(Q5, path = ['Store_Location','Product_Category'], values = 'Profit')
fig.show()
Let's Visualize the original data using three map
data
path = ['Store_City','Store_Location','Store_Name','Product_Category','Product_Name']
fig = px.treemap(data, path = path, values = 'Profit')
fig.show()